Software Vault: The Gold Collection

home *** CD-ROM | disk | FTP | other *** search

/ Software Vault: The Gold Collection / Software Vault - The Gold Collection (American Databankers) (1993).ISO / cdr25 / isp3009b.zip / ENGLISH.AFF < prev next >

Wrap

Text File | 1993-03-04 | 17KB | 429 lines

# # $Id: english.aff,v 1.11 91/09/12 00:01:33 geoff Exp $ # # Copyright 1987, 1988, 1989, by Geoff Kuenning, Manhattan Beach, CA # Permission for non-profit use is hereby granted. # All other rights reserved. # See "version.h" for a more complete copyright notice. # # $Log: english.aff,v $ # Revision 1.11 91/09/12 00:01:33 geoff # Fix the definition of "texchars" to reflect the new requirements, and # fix the comments describing that statement. # # Revision 1.10 91/09/04 18:00:46 geoff # Fix a tiny typo. # # Revision 1.9 91/05/27 21:35:12 geoff # Move the nroffchars and texchars statements to conform with the new # statement ordering restrictions. # # Revision 1.8 89/04/30 17:50:31 geoff # Rewrite the comments (and the character-set declarations) to describe and # conform to the new yacc grammar. # # Revision 1.7 89/02/18 00:51:23 geoff # Add definitions of "nroffchars" and "texchars", primarily to server as # an example. (Ken Stevens) # # Revision 1.6 88/12/26 02:24:03 geoff # Update the copyright notice. # # Revision 1.5 88/02/20 23:10:19 geoff # Add comments about what flags are within use. Fix a cosmetic case error. # # Revision 1.4 87/09/14 22:38:13 geoff # Add copyright comments # # Revision 1.3 87/09/09 00:16:17 geoff # Add comments about how charset statements now control collating order. # # Revision 1.2 87/08/02 17:17:03 geoff # Add three prefixes (A: re, I: in, and U: un), cross-product flags, and # extensive commentary roughly documenting the table format, in case this # file gets distributed by itself. # # Revision 1.1 87/06/28 23:13:54 geoff # Initial revision # # # Affix table for English # # Brief description of table syntax (* in syntax descriptions means one # or more occurrences; square brackets and braces have their usual meaning): # # Although the grammar is designed for a line-oriented layout, it is actually # a free-format yacc grammar and can be laid out weirdly if you want. # Comments are started by "#" and continue to the end of the line. # Backslashes are supported in the usual fashion (\nnn, plus specials # \n, \r, \t, \v, \f, \b, and the new hex format \xnn). Any character # with special meaning to the parser can be changed to an uninterpreted # token by backslashing it; for example you can declare a flag named # 'asterisk' or 'colon' with "flag \*:" or "flag \::". # # table : [headers] [prefixes] [suffixes] # # One of "prefixes" and "suffixes" is required. They can appear in either # order. # # headers : {options | char-sets }* # # The headers contain global descriptions and options. These are used to # describe the character sets used by the language, the characters that have # special meaning to the formatter, and the defaults for certain ispell # options. # # options : { fmtr-stmt | bool-stmt | flag-stmt } # # fmtr-stmt : { nroff-stmt | tex-stmt } # # A fmtr-stmt describes characters that have special meaning to a formatter. # Normally, this statement is not necessary, but some languages may have # preempted the usual defaults for use as language-specific characters. In # this case, these statements may be used to redefine the special characters # expected by the formatter. # # nroff-stmt : "nroffchars" string # # The nroff-stmt describes the characters that have special meaning to the # nroff/troff family of formatters. A string of exactly five characters must # be given. The default string is "().\*" and refers to the meanings of those # five characters in an unmodified nroff. In other words, the backslash is # used to introduce special formatter sequences, the asterisk is used to # introduce string substitutions, the left parenthesis is used after a backslash # to introduce two-character symbol names, and the period is used at the # beginning of a line to introduce formatter commands. (The right parenthesis # is not currently used, but is included for completeness.) # # tex-stmt : "texchars" string # # The tex-stmt is similarly used to describe characters that have special # meaning to the TeX and LaTeX formatters. A string of exactly 13 characters # must be given. The default string is "()[]{}<>\$*.%" and refers to the # meanings of those characters in an unmodified TeX processor. (The # dot, square brackets, and angle brackets also apply to the "tib" # bibliographic processor.) # # bool-stmt : { "compoundwords" on-or-off | "allaffixes" on-or-off } # # on-or-off : { "on" | "off" } # # A bool-stmt controls certain ispell defaults that are best made # language-specific. The "compoundwords" statement controls the default # for the "-B" and "-C" options, which control whether run-together words # are acceptable spellings. "Compoundwords" should usually be turned on # for languages such as German, where legal words may be formed from the # concatenation of shorter words. The default is "off". # # The "allaffixes" statement controls the default for the "-P" and "-m" # options, which control whether ispell will suggest possible # affix derivations for misspelled words even if possible corrections # are found in the dictionary. This option should usually be turned on # when you are developing the dictionary for a language, and off for # languages that have fairly complete dictionaries. The default is "off". # # flag-stmt : "flagmarker" character # # The flag-stmt describes the character which is used to separate affix # flags from the root word in a raw dictionary file. This must be a # character which is not found in any word (including in string characters; # see below). The default is "/" because this character is not normally # used to represent special characters in any language. # # char-sets : { char-stmt | string-stmt } # # The character-set section describes the characters that can be part of # a word. A char-stmt describes single characters; a string-stmt # describes characters that must appear together as a string. Either may # also describe conversion between upper and lower case. # # char-stmt : "wordchars" character-range # | "wordchars" lowercase-range uppercase-range # | "boundarychars" character-range # | "boundarychars" lowercase-range uppercase-range # string-stmt : "stringchar" string # | "stringchar" lowercase-string uppercase-string # # Characters described with the "boundarychars" statement are considered # part of a word only if they are embedded between characters declared with # the "wordchars" or "stringchar" statements. # # If two ranges or strings are given in a statement, the first describes # characters that are interpreted as lowercase and the second describes # uppercase. In the case of a string-stmt, the two strings must be of the # same length. Also, in a string-stmt, the actual strings may contain # both uppercase and characters themselves without difficulty; for instance # the statement # # stringchar \\*(sS \\*(Ss # # is legal and will not interfere with (or be interfered with by) other # declarations of of "s" and "S" as lower and upper case, respectively. # # A final note on string characters: some languages (e.g., German) collate # certain special characters as if they were strings. For example, "a-umlaut" # is traditionally sorted as if it were "ae". Ispell is not capable of this; # each character must be treated as an individual entity. So in certain cases, # ispell will sort a list of words into a different order than the standard # "dictionary" order for the subject language. # # A range is given as in egrep and the shell: [a-z] means lowercase # alphabetics; [^a-z] means all but lowercase, etc. If more than # one statement of a type is given, the characters given are combined # to produce the final list of characters. # # In addition to defining the character sets, the order of these statements # defines the collating sequence for sorts. If a range is given, the # characters defined by that statement will collate in their ASCII order. # # The character-declaration statements have a rather strange behavior caused # by the need to match each lowercase character with its uppercase equivalent. # In any given "wordchars" or "boundarychars" statement, the characters in # each range are first sorted into ASCII collating sequence, then matched # one-for-one with the other range. (The two ranges must have the same # number of characters). Thus, for example, the two statements: # # wordchars [aeiou] [AEIOU] # wordchars [aeiou] [UOIEA] # # would produce exactly the same effect. To get the vowels to match # up "wrong", you would have to use separate statements: # # wordchars a U # wordchars e O # wordchars i I # wordchars o I # wordchars u A # # which would cause myupper('e') to be 'O', and mylower('O') to be 'e'. # # This should normally be a problem only with languages which have been # forced to use a strange ASCII collating sequence. If your uppercase # and lowercase letters both collate in the same order, and your lowercase # characters are selected from the "lowercase specials" -- `~{}| -- # you shouldn't have to worry about this "feature". # # The prefixes and suffixes section have exactly the same syntax, except # that one is introduced with "prefixes" and one with "suffixes". I'll # describe them together. # # prefixes : "prefixes" table* # suffixes : "suffixes" table* # table : flagdef* # flagdef : "flag" [*] <character> ":" replacement* # # A prefix or suffix table consists of an introductory keyword and a list # of flag definitions. Flags can be defined more than once, in which case # the definitions are combined. Each flag controls one or more "replacements" # which are conditionally applied to the beginnings or endings of various # words. # # Flags are named by a single character. Depending on a configuration file # define, this character can be either an uppercase letter (the default # configuration) or any 7-bit ASCII character. Most languages should be # able to get along with just 26 flags, though. # # If an asterisk is placed before the flag character, it means that this # flag participates in cross-product formation. This only matters if the # file describes both prefixes and suffixes. If so, all prefixes and # suffixes marked with an asterisk will be applied in all combinations # to the root word. This can form a large number of words quickly, some # of which may be illegal, so watch out. If cross-products produce illegal # words, "munchlist" will not produce those flag combinations, and you # will just waste time (yours and ispell's). # # replacement : condition* ">" [ "-" strip-string "," ] append-string # # A replacement is a conditional rule for modifying a root word. Up to # 8 conditions may be specified. If the conditions are satisfied, the # rules on the right-hand side of the replacement are followed. If a # strip-string is given, it is first stripped from the beginning or ending # (as appropriate) of the root word. Then the append-string is added at # that point. For example, the condition "." means "any word", and the # condition "Y" means "any word ending in Y". The following (suffix) # replacements: # # . > MENT # Y > -Y,IES # # would change "induce" to "inducement" and "fly" to "flies". (If they # were controlled by the same flag, they would also change "fly" to "flyment", # which might not be what was wanted). # # No matter how much you might wish it, the strings on the right must be # strings of specific characters, not ranges. The reasons are rooted deeply # in the way ispell work, and it would be difficult or impossible to provide # for more flexibility. For example, you might wish to write: # # [EY] > -[EY],IES # # Too bad. You have to write it as two separate rules: # # E > -E,IES # Y > -Y,IES # # condition : { "." | character | range } # # A condition is a restriction on the characters that adjoin and/or are # replaced by the right-hand side of the replacement. Up to 8 conditions # may be given, which should be enough context for anyone. The right-hand # side will be applied only if the conditions in the replacement are # satisfied. The conditions also implicitly define a length; roots # shorter than the number of conditions will not pass the test. (As a # special case, a condition of a single dot "." defines a length of zero, # so that the rule applies to all words indiscriminately). # # Conditions that are single characters should be separated by white space. # # An example from below: # # flag *S: # [^AEIOU]Y > -Y,IES # As in imply > implies # [AEIOU]Y > S # As in convey > conveys # [SXZH] > ES # As in fix > fixes # [^SXZHY] > S # As in bat > bats # # The first line applies to words ending in Y, but not in vowel-Y. The # second takes care of the vowel-Y words. The third then handles those # words that end in a sibilant or near-sibilant, and the last picks up # everything else. # # Note that the conditions are written very carefully so that they apply # to disjoint sets of words. In particular, note that the fourth line # excludes words ending in Y as well as the obvious SXZH. Otherwise, it # would convert "imply" into "implys". # # Although we do not do so in this file, we could also have a flag # generate more than one variation on a root word. For example, we # could extend the "R" flag as follows: # # flag *R: # E > R # As in skate > skater # E > RS # As in skate > skaters # [^AEIOU]Y > -Y,IER # As in multiply > multiplier # [^AEIOU]Y > -Y,IERS # As in multiply > multipliers # [AEIOU]Y > ER # As in convey > conveyer # [AEIOU]Y > ERS # As in convey > conveyers # [^EY] > ER # As in build > builder # [^EY] > ERS # As in build > builders # # This flag would generate both "skater" and "skaters" from "skate". This # capability can be very useful in language that make use of noun, verb, # and adjective endings. For instance, one could define a single flag # that generated all of the German "weak" verb endings. # nroffchars ().\\* texchars ()\[]{}<\>\\$*.% # And now to the English table itself # First we declare the character set. Since it's English, it's easy. # The only special character is the apostrophe, so that possessives can # be handled. We declare it as a boundary character, so that quoting with # single quotes doesn't confuse things. The apostrophe is the only # character that gets such treatment. # # We declare the apostrophe first so that "Jon's" collates before "Jonas". # (This is the way ASCII does it). # boundarychars ' wordchars [a-z] [A-Z] # Here's a record of flags used, in case you want to add new ones. # Right now, we fit within the minimal MASKBITS definition. # # ABCDEFGHIJKLMNOPQRSTUVWXYZ # Used: * * **** ** * ***** *** # A D GHIJ MN P RSTUV XYZ # Available: -- -- -- - - - # BC EF KL O Q W # Now the prefix table. There are only three prefixes that are truly # frequent in English, and none of them seem to need conditional variations. # prefixes flag *A: . > RE # As in enter > reenter flag *I: . > IN # As in disposed > indisposed flag *U: . > UN # As in natural > unnatural # Finally, the suffixes. These are exactly the suffixes that came out # with the original "ispell"; I haven't tried to improve them. The only # thing I did besides translate them was to add selected cross-product flags. # suffixes flag V: E > -E,IVE # As in create > creative [^E] > IVE # As in prevent > preventive flag *N: E > -E,ION # As in create > creation Y > -Y,ICATION # As in multiply > multiplication [^EY] > EN # As in fall > fallen flag *X: E > -E,IONS # As in create > creations Y > -Y,ICATIONS # As in multiply > multiplications [^EY] > ENS # As in weak > weakens flag H: Y > -Y,IETH # As in twenty > twentieth [^Y] > TH # As in hundred > hundredth flag *Y: . > LY # As in quick > quickly flag *G: E > -E,ING # As in file > filing [^E] > ING # As in cross > crossing flag *J: E > -E,INGS # As in file > filings [^E] > INGS # As in cross > crossings flag *D: E > D # As in create > created [^AEIOU]Y > -Y,IED # As in imply > implied [^EY] > ED # As in cross > crossed [AEIOU]Y > ED # As in convey > conveyed flag T: E > ST # As in late > latest [^AEIOU]Y > -Y,IEST # As in dirty > dirtiest [AEIOU]Y > EST # As in gray > grayest [^EY] > EST # As in small > smallest flag *R: E > R # As in skate > skater [^AEIOU]Y > -Y,IER # As in multiply > multiplier [AEIOU]Y > ER # As in convey > conveyer [^EY] > ER # As in build > builder flag *Z: E > RS # As in skate > skaters [^AEIOU]Y > -Y,IERS # As in multiply > multipliers [AEIOU]Y > ERS # As in convey > conveyers [^EY] > ERS # As in build > builders flag *S: [^AEIOU]Y > -Y,IES # As in imply > implies [AEIOU]Y > S # As in convey > conveys [SXZH] > ES # As in fix > fixes [^SXZHY] > S # As in bat > bats flag *P: [^AEIOU]Y > -Y,INESS # As in cloudy > cloudiness [AEIOU]Y > NESS # As in gray > grayness [^Y] > NESS # As in late > lateness flag *M: . > 'S # As in dog > dog's